The following analysis process was created using the data set “Mroz87”, containing information of 753 married woman in 1975.
df <- read.csv(file = "/Users/brendarangel/Desktop/Dataset.csv")
df <- df %>%
rename(
EDUCATION = educ,
WAGE = wage,
EXPERIENCE = exper,
CITY = city,
AGE = age,
LFP = lfp,
HOURS = hours,
UNEM = unem,
MTR = mtr,
KIDS_5 = kids5
)
# For neatness, we will create another dataframe out of the original just with the required columns.
df_2 <- df[c('EDUCATION', 'WAGE', 'EXPERIENCE','CITY','AGE','LFP','HOURS','KIDS_5')]
# Verify datatypes of columns
str(df_2)
'data.frame': 753 obs. of 8 variables:
$ EDUCATION : int 12 12 12 12 14 12 16 12 12 12 ...
$ WAGE : num 3.35 1.39 4.55 1.1 4.59 ...
$ EXPERIENCE: int 14 5 15 6 7 33 11 35 24 21 ...
$ CITY : int 0 1 0 0 1 1 0 0 0 0 ...
$ AGE : int 32 30 35 34 31 54 37 54 48 39 ...
$ LFP : int 1 1 1 1 1 1 1 1 1 1 ...
$ HOURS : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
$ KIDS_5 : int 1 0 1 0 1 0 0 0 0 0 ...
# Calculate descriptive statistics per column
summary(df_2)
EDUCATION WAGE EXPERIENCE CITY AGE LFP
Min. : 5.00 Min. : 0.000 Min. : 0.00 Min. :0.0000 Min. :30.00 Min. :0.0000
1st Qu.:12.00 1st Qu.: 0.000 1st Qu.: 4.00 1st Qu.:0.0000 1st Qu.:36.00 1st Qu.:0.0000
Median :12.00 Median : 1.625 Median : 9.00 Median :1.0000 Median :43.00 Median :1.0000
Mean :12.29 Mean : 2.375 Mean :10.63 Mean :0.6428 Mean :42.54 Mean :0.5684
3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:49.00 3rd Qu.:1.0000
Max. :17.00 Max. :25.000 Max. :45.00 Max. :1.0000 Max. :60.00 Max. :1.0000
HOURS KIDS_5
Min. : 0.0 Min. :0.0000
1st Qu.: 0.0 1st Qu.:0.0000
Median : 288.0 Median :0.0000
Mean : 740.6 Mean :0.2377
3rd Qu.:1516.0 3rd Qu.:0.0000
Max. :4950.0 Max. :3.0000
# Get general view of variable relationships by plotting
pairs(df_2)
plot(
df[c('EDUCATION','WAGE')],
main = "EDUCATION AND WAGE RELATIONSHIP",
xlab = "EDUCATION",
ylab ="WAGE",
font.lab = 2
)
plot(
df[c('KIDS_5','WAGE')],
main = "KIDS AND WAGE RELATIONSHIP",
xlab = "KIDS",
ylab ="WAGE",
font.lab = 2
)
We can observe a clear relationship between both paris of variables. The more education the married woman had, the more wage they perceived. On the other hand, the less children under 5 years old they had, the more wage they perceived.
First we are going to do an exploratory analysis and check if the cuantitative variables have a normal distribution or not.
par("mfcol"=c(4, 1))
hist(df$EXPERIENCE, col="blue")
hist(df$EDUCATION, col="green")
hist(df$WAGE, col="red")
hist(df$KIDS_5, col="yellow")
par("mfcol"=c(1, 1))
We can visualy observe that variables EXPERIENCE, WAGE and KIDS_5 are not normalized; whereas EDUCATION seems to be normalized, but we are not sure. In order to be sure, we will use Shapiro Test to confirm it.
shapiro.test(df$EDUCATION)
Shapiro-Wilk normality test
data: df$EDUCATION
W = 0.88634, p-value < 2.2e-16
EDUCATION is not normal either! (p-value < .05)
So all variables are not distributed normally, and besides, their numeric ranges are sort of different from each other, which could affect the model. We are going to go ahead and normalize the data. We could get the log of the variables, but since there are many 0s in our dataset, we will be using min max scaling normalization instead.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
par("mfcol"=c(4, 1))
hist(normalize(df$EXPERIENCE), col="blue")
hist(normalize(df$EDUCATION), col="green")
hist(normalize(df$WAGE), col="red")
hist(normalize(df$KIDS_5), col="yellow")
par("mfcol"=c(1, 1))
They still have no normal distribution, but they have the same min - max range now, which can positively affect the model.
Now we are going to check if there is any visible multicolineality between the independent variables that could affect our multiple regression. Since the variables are not distributed normally, we are going to be using spearman correlation.
df_4 <- df[c('EXPERIENCE','CITY','EDUCATION','KIDS_5')]
cor(df_4, method = 'spearman')
EXPERIENCE CITY EDUCATION KIDS_5
EXPERIENCE 1.00000000 0.02734453 0.08166274 -0.20686556
CITY 0.02734453 1.00000000 0.17150742 -0.03917689
EDUCATION 0.08166274 0.17150742 1.00000000 0.08951226
KIDS_5 -0.20686556 -0.03917689 0.08951226 1.00000000
Multicolineality Results: No visible strong correlation between them; no need to check p-value.
We are going to go ahead and create the multiple regression.
model <- lm(normalize(WAGE) ~ normalize(EXPERIENCE) + normalize(CITY) + normalize(EDUCATION) + normalize(KIDS_5), data = df)
summary(model)
Call:
lm(formula = normalize(WAGE) ~ normalize(EXPERIENCE) + normalize(CITY) +
normalize(EDUCATION) + normalize(KIDS_5), data = df)
Residuals:
Min 1Q Median 3Q Max
-0.24165 -0.07207 -0.01772 0.04785 0.95351
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.065831 0.015778 -4.172 3.37e-05 ***
normalize(EXPERIENCE) 0.149613 0.024696 6.058 2.18e-09 ***
normalize(CITY) 0.002076 0.009158 0.227 0.820716
normalize(EDUCATION) 0.215739 0.023319 9.252 < 2e-16 ***
normalize(KIDS_5) -0.086709 0.025498 -3.401 0.000708 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1186 on 748 degrees of freedom
Multiple R-squared: 0.1673, Adjusted R-squared: 0.1629
F-statistic: 37.58 on 4 and 748 DF, p-value: < 2.2e-16
The associated p-value of the F-statistic is less than .05, indicating that the model is significant.
If we analyse the p-values of the variables, we can observe that both variables EXPERIENCE, EDUCATION and KIDS_5 are statistically significant (95%). The variable CITY however can be removed from the multiple regression.
# Multiple Regression with no CITY
model <- lm(normalize(WAGE) ~ normalize(EXPERIENCE) + normalize(EDUCATION) + normalize(KIDS_5), data = df)
summary(model)
Call:
lm(formula = normalize(WAGE) ~ normalize(EXPERIENCE) + normalize(EDUCATION) +
normalize(KIDS_5), data = df)
Residuals:
Min 1Q Median 3Q Max
-0.24325 -0.07178 -0.01762 0.04750 0.95435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06498 0.01532 -4.242 2.49e-05 ***
normalize(EXPERIENCE) 0.14955 0.02468 6.060 2.16e-09 ***
normalize(EDUCATION) 0.21661 0.02298 9.425 < 2e-16 ***
normalize(KIDS_5) -0.08707 0.02543 -3.424 0.000652 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1186 on 749 degrees of freedom
Multiple R-squared: 0.1673, Adjusted R-squared: 0.164
F-statistic: 50.16 on 3 and 749 DF, p-value: < 2.2e-16
By removing the CITY variable, the adjusted R squared only changed from 0.1629 to 0.164, meaning that evidently CITY had not much impact in the model. However, having an R squared this low means that only 16.4 % of the variance in the measure of wages can be predicted by experience, education and having small kids.
ML Regression: WAGE = -0.06498 + 0.14955EXPERIENCE +
0.21661EDUCATION -.08707*KIDS_5.
Variable with greater impact: EDUCATION.
Statistically significant variables (to 95%): EXPERIENCE, EDUCATION,
KIDS_5.
Without doing any analysis, common sense might suggest that LFP-HOURS will have the strongest correlation, since the more hours of work will obviously mean labor force participation. However, lets prove this through analysis.
First we are going to plot this pairs in order to get a visual perspective of their correlations.
plot(df[c('LFP','AGE')], main = "LFP vs AGE")
plot(df[c('LFP','WAGE')], main = "LFP vs WAGE")
plot(df[c('LFP','HOURS')], main = "LFP vs HOURS")
We can easily observe that variable LFP is dichotomic, whereas the rest are cuantitative. For this reason we will be using the point-biserial correlation for each pair.
x <- unlist(df[c('LFP')])
cor.test(x,unlist(df[c('AGE')]))
Pearson's product-moment correlation
data: x and unlist(df[c("AGE")])
t = -2.2132, df = 751, p-value = 0.02718
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.151075065 -0.009104645
sample estimates:
cor
-0.08049811
cor.test(x,unlist(df[c('WAGE')]))
Pearson's product-moment correlation
data: x and unlist(df[c("WAGE")])
t = 22.748, df = 751, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5943858 0.6791617
sample estimates:
cor
0.638708
cor.test(x,unlist(df[c('HOURS')]))
Pearson's product-moment correlation
data: x and unlist(df[c("HOURS")])
t = 30.254, df = 751, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7071441 0.7717269
sample estimates:
cor
0.7411454
Out of the three pairs of variables, the one with more correlation is LFP vs HOURS, just as we suggested.
The point-biserial correlations coefficient for this pair is of 0.7411 (a lot higher than the other two pairs) meaning they have a positive correlation (when LFP takes de value of 1, the variables HOURS tends to be higher).
Its p-value is less than .05, meaning that we accept the alterative hypothest and that the correlation is statistically significant. Additionally, the confidence interval further confirms that this correlation is significant.
First we are going to do an exploratory analysis and check if the cuantitative variables have a normal distribution or not. From excersise 4 we already know that EDUCATION, WAGE, EXPERIENCE and KIDS_5 are not normally distributed, so we are going to do focus on AGE, UNEM, and HOURS.
par("mfcol"=c(3, 1))
hist(df$AGE, col="blue")
hist(df$UNEM, col="green")
hist(df$HOURS, col="red")
par("mfcol"=c(1, 1))
None of the variables seem to be normally distributed. Any how, we will be doing once again the Shapiro Test to confirm.
shapiro.test(df$AGE)
Shapiro-Wilk normality test
data: df$AGE
W = 0.96162, p-value = 3.829e-13
shapiro.test(df$UNEM)
Shapiro-Wilk normality test
data: df$UNEM
W = 0.92283, p-value < 2.2e-16
shapiro.test(df$HOURS)
Shapiro-Wilk normality test
data: df$HOURS
W = 0.80675, p-value < 2.2e-16
As expected, none of them follow a normal distribution. Knowing this we are going to go ahead and normalize the variables using min max scaling normalization. The result will probably not create a normal distribution for the variables, but will make the data range equivalent (in terms of range) to the rest of the variables.
par("mfcol"=c(3, 1))
hist(normalize(df$AGE), col="blue")
hist(normalize(df$UNEM), col="green")
hist(normalize(df$HOURS), col="red")
par("mfcol"=c(1, 1))
We are now going to see if there is multicolineality between the independent variables that could affect our logistic regression. For this, we will be using spearman, since the variables are not normally distributed.
df_6 <- df[c('LFP', 'AGE', 'EDUCATION','WAGE','UNEM','HOURS','KIDS_5','CITY','EXPERIENCE')]
df_6$AGE = normalize(df$AGE)
df_6$EDUCATION = normalize(df$EDUCATION)
df_6$WAGE = normalize(df$WAGE)
df_6$UNEM = normalize(df$UNEM)
df_6$HOURS = normalize(df$HOURS)
df_6$KIDS_5 = normalize(df$KIDS_5)
df_6$EXPERIENCE = normalize(df$EXPERIENCE)
res <- rcorr(as.matrix(df_6))
res
LFP AGE EDUCATION WAGE UNEM HOURS KIDS_5 CITY EXPERIENCE
LFP 1.00 -0.08 0.19 0.64 -0.03 0.74 -0.21 -0.01 0.34
AGE -0.08 1.00 -0.12 -0.03 0.08 -0.03 -0.43 0.10 0.33
EDUCATION 0.19 -0.12 1.00 0.32 0.07 0.11 0.11 0.16 0.07
WAGE 0.64 -0.03 0.32 1.00 0.00 0.42 -0.12 0.07 0.25
UNEM -0.03 0.08 0.07 0.00 1.00 -0.06 -0.01 0.18 0.00
HOURS 0.74 -0.03 0.11 0.42 -0.06 1.00 -0.22 -0.02 0.40
KIDS_5 -0.21 -0.43 0.11 -0.12 -0.01 -0.22 1.00 -0.04 -0.19
CITY -0.01 0.10 0.16 0.07 0.18 -0.02 -0.04 1.00 0.01
EXPERIENCE 0.34 0.33 0.07 0.25 0.00 0.40 -0.19 0.01 1.00
n= 753
P
LFP AGE EDUCATION WAGE UNEM HOURS KIDS_5 CITY EXPERIENCE
LFP 0.0272 0.0000 0.0000 0.4311 0.0000 0.0000 0.8658 0.0000
AGE 0.0272 0.0009 0.3436 0.0345 0.3642 0.0000 0.0081 0.0000
EDUCATION 0.0000 0.0009 0.0000 0.0478 0.0036 0.0028 0.0000 0.0692
WAGE 0.0000 0.3436 0.0000 0.9972 0.0000 0.0007 0.0728 0.0000
UNEM 0.4311 0.0345 0.0478 0.9972 0.0983 0.8042 0.0000 0.9033
HOURS 0.0000 0.3642 0.0036 0.0000 0.0983 0.0000 0.6568 0.0000
KIDS_5 0.0000 0.0000 0.0028 0.0007 0.8042 0.0000 0.2426 0.0000
CITY 0.8658 0.0081 0.0000 0.0728 0.0000 0.6568 0.2426 0.7583
EXPERIENCE 0.0000 0.0000 0.0692 0.0000 0.9033 0.0000 0.0000 0.7583
There are indeed correlations between independent variables that exceed |.30| in value with p-value of 0, indicating that these correlations are statistically significant. This can be reduced once we remove some variables through out the creation of the model.
HOURS has the strongest correlation, followed by WAGE. The rest of them have lower correlations with Y. UNEM and CITY have a p-value above .05, meaning they have no relationship with LFP and can be excluded from the model.
Now we are going to check if the target variable is biased (if the target has its data equally distribute between 0s and 1s)
table(df_6$LFP)
0 1
325 428
Since the data is divided similarly, we consider there is no class bias and can continue with our analysis.
We are going to go ahead and create our test/train samples for the model, and create the model afterwards.
df_6 <- df_6[c('LFP', 'AGE', 'EDUCATION','WAGE','HOURS','KIDS_5','EXPERIENCE')]
bound <- floor((nrow(df_6)/4)*3)
df_6 <- df_6[sample(nrow(df_6)), ]
df_6.train <- df_6[1:bound, ]
df_6.test <- df_6[(bound+1):nrow(df_6), ]
mylogit <- glm(LFP ~ EDUCATION + WAGE + HOURS + KIDS_5 + EXPERIENCE + AGE, data=df_6.train, family=binomial(link="logit"), control = list(maxit = 100))
glm.fit: fitted probabilities numerically 0 or 1 occurred
While creating the model, an error appeared.
This Warning message indicates that there is a variable that perfectly separates zeroes and ones in the target variable. Since this excercise is about creating a logistic regression and not jumping into any other sort of regression, we are going to go ahead and remove this variable from the equation.
To find the variable causing this perfect separation we are going to do a logistics regression for each separate X-Y pair, and remove the one that gives us a warning.
mylogit2 <- glm(LFP ~ EDUCATION, data = df_6.test, family = "binomial", control = list(maxit = 100))
mylogit2 <- glm(LFP ~ WAGE, data = df_6.test, family = "binomial", control = list(maxit = 100))
glm.fit: fitted probabilities numerically 0 or 1 occurred
mylogit2 <- glm(LFP ~ HOURS, data = df_6.test, family = "binomial", control = list(maxit = 100))
glm.fit: fitted probabilities numerically 0 or 1 occurred
mylogit2 <- glm(LFP ~ KIDS_5, data = df_6.test, family = "binomial", control = list(maxit = 100))
mylogit2 <- glm(LFP ~ EXPERIENCE, data = df_6.test, family = "binomial", control = list(maxit = 100))
mylogit2 <- glm(LFP ~ AGE, data = df_6.test, family = "binomial", control = list(maxit = 100))
We can observe that the one giving us a warning is WAGE and HOURS, so we are going to remove them and go ahead and create de logistic regression with the rest of the variables and get the predicted scores.
mylogit <- glm(LFP ~ EDUCATION + KIDS_5 + EXPERIENCE + AGE , data=df_6.train, family=binomial, control = list(maxit = 100))
predicted <- plogis(predict(mylogit, df_6.test))
summary(mylogit)
Call:
glm(formula = LFP ~ EDUCATION + KIDS_5 + EXPERIENCE + AGE, family = binomial,
data = df_6.train, control = list(maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4286 -0.9645 0.4685 0.9008 2.0606
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7324 0.4028 -1.818 0.069046 .
EDUCATION 2.0180 0.5548 3.637 0.000276 ***
KIDS_5 -3.8559 0.6486 -5.945 2.77e-09 ***
EXPERIENCE 5.7521 0.6985 8.235 < 2e-16 ***
AGE -2.9286 0.4608 -6.355 2.08e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 771.05 on 563 degrees of freedom
Residual deviance: 622.88 on 559 degrees of freedom
AIC: 632.88
Number of Fisher Scoring iterations: 4
Next we are going to calculate the best cutoff for the model. Normaly it would be 0.5 (since we have 0s and 1s), but getting the optimal cutoff might improve results.
library(InformationValue)
optCutOff <- optimalCutoff(df_6.test$LFP, predicted)
optCutOff
[1] 0.4370756
And now we will get the percentage mismatch between predicted and actual values.
misClassError(df_6.test$LFP, predicted, threshold = optCutOff)
[1] 0.2116
The ROC curve.
plotROC(df_6.test$LFP, predicted)
And confusion Matrix.
confusionMatrix(df_6.test$LFP, predicted, threshold = optCutOff)
We can observe the difference between the null deviance vs the residual deviance in our model’s summary, showing that our model is indeed doing better against a null model.
The percentage mismatch between predicted and actual values is low, which is a good indicator.
The ROC AUC score is of .8218, indicating a good discrimination.
The confusion matrix shows us relatively good results, with an accuracy of 78.83%, precision of 85.98% and sensitivity of 78.63%.
In general, the labor force participation from married woman in 1975 can be widely explained by examining their education, experience, age, and if they had kids under 5 years old.